Skip to content

add module vsan capacity#40

Open
schuetzi99 wants to merge 3 commits into
ConSol-Monitoring:mainfrom
schuetzi99:vsan_capacity
Open

add module vsan capacity#40
schuetzi99 wants to merge 3 commits into
ConSol-Monitoring:mainfrom
schuetzi99:vsan_capacity

Conversation

@schuetzi99

Copy link
Copy Markdown

This additional vsan module checks the capacity. The calculations are done including slack and resync. The programming was done with the help of ChatGPT.

Comment thread checkvsphere/vcmd/vsan.py Outdated
Comment thread checkvsphere/vcmd/vsan.py Outdated
Comment thread checkvsphere/vcmd/vsan.py Outdated
Comment thread checkvsphere/vcmd/vsan.py Outdated
Comment thread checkvsphere/vcmd/vsan.py Outdated
Comment thread checkvsphere/vcmd/vsan.py Outdated
Comment thread checkvsphere/vcmd/vsan.py Outdated
@schuetzi99 schuetzi99 requested a review from datamuc March 25, 2026 08:07
@schuetzi99

Copy link
Copy Markdown
Author

@datamuc Could you please review my solutions? Is there anything else that needs to be changed?

@datamuc

datamuc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@datamuc Could you please review my solutions? Is there anything else that needs to be changed?

https://developer.broadcom.com/xapis/vsan-management-api/latest/vim.cluster.VsanSpaceUsage.html

slackSpaceB and resyncSpaceB are not there

@datamuc

datamuc commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Your instinct is right — this patch is broken in a way that's especially bad for a monitoring check. Let me confirm the data model against the actual API before I make claims about field names.You're right to distrust it. I pulled the actual VsanSpaceUsage data object definition, and the patch is broken in several ways — the worst being that it fails silently green, which is the most dangerous failure mode a monitoring check can have.

The confirmed VsanSpaceUsage object has exactly these fields: totalCapacityB, freeCapacityB, uncommittedB, spaceOverview, spaceDetail, efficientCapacity, whatifCapacities. Note what's not there.

The real problems, worst first:

  1. slackSpaceB and resyncSpaceB don't exist. There are no such fields anywhere on the space-usage object. Because the code reads them with getattr(capacity, 'slackSpaceB', 0), they silently resolve to 0 forever. So effective_free = free - slack - resync is just free - 0 - 0 = free. The entire headline feature of this check — accounting for slack and resync — does literally nothing. The perfdata always reports slack_gb=0.0, resync_gb=0.0.

  2. QueryVsanManagedStorageSpaceUsage returns an array, not an object. The API returns VsanSpaceUsageWithDatastoreType[]. The code does capacity = ...QueryVsanManagedStorageSpaceUsage(...) and then getattr(capacity, 'totalCapacityB', 0) on the list. That returns the default 0. So total=0, and every downstream calc is guarded by if total > 0, meaning usage_pct=0, effective_free_pct=0 → it reports a happy usage=0% and exits OK. Your cluster could be 99% full and this says fine.

  3. The defensive getattr(..., 0) pattern is the root rot. It's used on every field. That converts "I got the field name wrong" and "I got the return type wrong" — both fatal mistakes — into silent zeros instead of exceptions. For a Nagios/Icinga check, a wrong field name should blow up loudly, not quietly green. This single stylistic choice is why bugs 1 and 2 don't surface during testing.

  4. The try/fallback never reaches the fallback. The intent is "try ManagedStorageSpaceUsage, fall back to QuerySpaceUsage on error." But the primary call succeeds (returns a list), and reading attributes off a list via getattr-with-default doesn't raise — so no exception, so the fallback never fires. Separately, the fallback calls .QuerySpaceUsage(...), but the bound method name is VsanQuerySpaceUsage (that's the WSDL operation name); QuerySpaceUsage would AttributeError. So even if reached, it's wrong.

  5. spaceOverview.usedB is fragile — used capacity is better derived as total - free. With the getattr-0 default, a wrong subfield name again just yields used=0.

  6. Unit typo free/1024*3 in patch 1 (that's (free/1024)*3, not GiB at all). Patch 2 fixes the message string but it shows the overall carelessness. Minor pedantry: 1024**3 is GiB, labeled "GB".

The deeper conceptual issue: "slack space" is legacy pre-7.0U1 terminology and was never a queryable field. Since 7U1 it's Operations Reserve + Host Rebuild Reserve, and — importantly — those reserved byte values are not in VsanSpaceUsage either. They live in the cluster capacity-reserve config/thresholds (VsanVcClusterConfigSystem), or you derive the reserved amount from the threshold percentages. So the patch is trying to read reserve data from an object that has never contained it, under field names that have never existed. The author seems to have pattern-matched to what they wished the API had.

What actually works for a capacity check: call VsanQuerySpaceUsage(cluster=clusterRef), take totalCapacityB and freeCapacityB, compute used = total - free, and alert on usage %. That's reliable. If you genuinely want the reserve-aware "effective free," that's a second call against the config system to get the operations/host-rebuild thresholds, then arithmetic — it can't come from this object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants